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Abstract. Since its introduction prediction-by-partial-matching (PPM) 
has always been a de facto gold standard in lossless text compression, 
where many variants improving the compression ratio and speed have 
been proposed. However, reducing the high space requirement of PPM 
schemes did not gain that much attention. This study focuses on reducing 
the memory consumption of PPM via the recently proposed compressed 
context modeling (CCM) that uses the compressed representations of con- 
texts in the statistical model. Differently from the classical context defi- 
nition as the string of the preceding characters at a particular position, 
CCM considers context as the amount of preceding information that is 
actually the bit-stream composed by compressing the previous symbols. 
We observe that by using the CCM, the data structures, particularly 
the context trees, can be implemented in smaller space, and present a 
trade-off between the compression ratio and the space requirement. The 
experiments conducted showed that this trade-off is especially beneficial 
in low orders with a ~ 20-25 percent gain in memory by a sacrifice of 
up to w 7 percent loss in compression ratio. 



1 Introduction 

Originating from the idea of predicting next symbol according to the statistics 
collected from the preceding symbols, prediction-by-partial-matching (PPM) has 
always been a de facto gold standard in lossless text compression since its intro- 
duction. The major drawbacks of the scheme are huge memory consumption and 
slow speed, which prohibited its wide-spread usage in practice for a long time, 
e.g., although the original scheme was proposed in 1984 [3], it was not until 1990 
[12j that the first practical implementation appeared as a consequence of limited 
memory and computing power at the time of initial proposal. 

Many PPM variants |12I19I8I6I2I16| . which improved the compression ratio 
as well as the speed, have been introduced during the last three decades. How- 
ever, reducing the memory usage did not gain that much attention, and the 
advancement of technology has been assumed the only source of progress in that 
direction. 



Today, it is true that we have plenty of memory in our computers, but also 
have many memory-hungry applications demanding more space as well. The sit- 
uation worsens on resource-limited environments such as the mobile phones or 
hand-held devices that are surrounding the world in today's ubiquitous comput- 
ing environment. When we consider the fact that the amount of data exchanged 
in wireless communication channels is increasing in an unprecedented rate, where 
the users are billed according to the number of bytes they transmit, data com- 
pression in mobile devices will apparently become more important in the very 
near future. Although PPM would be a strong option here, the statistical context 
modeling requiring significant run-time memory may lack its practical impact 
on those resource-limited environments. 

This study investigates ways to reduce the memory consumption of PPM 
via the recently proposed compressed context modeling (CCM) |10| that uses 
the compressed representations of contexts in the statistical model. Differently 
from the classical context definition as the string of the preceding characters 
at a particular position, the compressed context modeling considers context as 
the amount of previous information that is actually the bit-stream composed 
by compressing the preceding symbols. We observe that by using the CCM, the 
data structures, particularly the context trees, can be implemented in smaller 
space. Based on this observation we present a compression-ratio/memory trade- 
off via CCM. The experiments conducted showed that this trade-off is especially 
beneficial in low orders with a ks 20-25 percent gain in memory by sacrificing 
up to « 7 percent loss in compression. 

As the outline of the paper, the PPM type compression scheme and several 
major improvements achieved previously are reviewed in section 2. In section 
3, we introduce the main idea reducing the memory usage in PPM by the in- 
tegration of compressed context modeling. The implementation of the proposed 
PPM CC is described in section 4 that is followed by the experimental results an- 
alyzing the trade-off between the compression ratio and the space requirement. 
We conclude with summary of the findings and future research directions. 

2 PPM and Previous Improvements 

In a Markovian sequence the appearance of a symbol at a specific position is 
assumed to be highly dependent on its immediate predecessors. Cleary & Witten 
[3] introduced the prediction-by-partial-matching (PPM) compression scheme 
based on this principle. The basic idea in PPM is to predict the next symbol 
based on its context, which is defined as the fc-symbols preceding the current 
position. Assuming text T of length n as T — t\ti . . . t n , the order-fc context of 
U, k < i < n, is ti-iti-2 ■ ■ - ti-k- The first step while encoding tj in PPM is to 
check whether its order-fc context has ever been followed by U previously. If the 
pattern ti—kti-k+i ■ ■ ■ U-iU has been observed in t\t2 ■ ■ .ti~ i, the probability 
P(U | U-kU-k+i ■ ■ -ti—i) is sent to the entropy encoder. Otherwise, an escape 
symbol is emitted with the probability P(escape \ ti-kU-k+i ■ ■ ■ ~ti-i)i which 
is computed according to the zero-frequency handling of the used statistical 
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model, and the length of the context is shortened usually by decrementing k 
by 1. The same procedure is repeated with the shortened context until a non- 
zero probability is obtained for ti or the context length becomes zero. If the 
probability of t{ is still missing in the order-0 model, which means ti is a novel 
symbol appearing first at position i, the encoding of the actual raw value is 
performed after the emission of the escape. 

The compression performance of PPM improves with increasing length of the 
context. However, after some certain point the compression ratio begins deterio- 
rating. On the Calgary corpus, it is reported that after order-6 the compression 
becomes going downhill [TS]. That is due to the sparseness of the statistics, 
which accelerates rapidly with the increasing length of the context, and causes 
long chains of escape symbols. Therefore, much of the attention to improve PPM 
has been paid in the direction of computing the probability of a zero-frequency 
item, which is also a fundamental problem in statistics [1914120] . In that sense, 
different approaches of calculating the escape and symbol probabilities led to 
PPM variants such as PPMC [12 , PPMP and PPMX [15], and FastPPM [5]. 
The PPMII |17I16| scheme of Shkarin introduced an information inheritance 
mechanism that the missing probabilities in long contexts are estimated from 
their shorter subsequences. More recently, a lossless compression scheme based 
on the sequence memoizer j7| has been introduced which improves the compres- 
sion ratio by enhanced symbol and escape probability estimations. 
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Fig. 1. Context modeling techniques 



Another direction of research has been to apply different context modeling 
|llll) approaches that are depicted in Figure [T] In early studies |3|12|8j the con- 
text was assumed to be a fixed-size window of preceding symbols at a particular 
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position. As oppose to using this bounded-length windows, the deterministic con- 
text [5] was proposed with PPM*. While encoding the i th character tj, the PPM* 
first investigates whether there exists a string ti- z . . .U-% of arbitrary length z 
(z > i), which is always followed by the same symbol U. In case such a deter- 
ministic context exists, the encoding can be achieved very effectively. Otherwise, 
the PPM* switches to the classic bounded-length model. Instead of switching to 
a fixed-order model, the PPMZ [2j scheme following the PPM* proposed using 
local- order- estimation (LOE) that decides on the order of the context at a posi- 
tion according to a heuristic function. Therefore, the variable-order context idea 
had first appeared with the PPMZ. Another variant of PPM that uses variable- 
order modeling is the PPMVC [H] that also benefits from PPMII idea. Note 
that both the unbounded-length deterministic context investigations as well as 
the variable-order calculations require additional computational resources with 
an increased memory consumption. 

3 Preserving Memory via Compressed Context Modeling 

On a given text T — t\ti . . . t n of length n, where the individual characters are 
drawn from alphabet S, the order-fc compressed context of ti is the first k- 
bits of the preceding information that is computed via compressing the string 
ti^i . . . ti-i for sufficiently large £, 1 < I < i. Assuming C is a proper compression 
function, we search for the smallest I such that the length of the bit-stream 
formed by C(ti-\ti-i . . . U-i) is larger than or equal to k. 

In this work, we use a standard O^-order Huffman coding as the C com- 
pression function. Obviously, the Huffman code table should also be attached to 
the final compressed file, which brings an overhead. Since using a higher order 
Huffman will enlarge that overhead, the choice of O'^-order is supposed to be a 
better fit especially in case of small files as in the Large Calgary corpus. One may 
argue that it is possible to integrate a dynamic Huffman compression as well, 
where we do not need to carry any additional information and also do not need 
to perform an initial scan over the file to calculate the Huffman tree. Since such 
dynamic approaches work better on longer files, they are not included in this 
study as we basically aim to see whether CCM can achieve a space preservation 
in its simplest settings. Future studies are expected to investigate more options 
in that sense. 

The data structures used in context modeling are either the hash tables [T3] 
or the context tries. Since slow speed is one of the major problems in PPM, most 
of the implementations in practice prefer context trie data structures to achieve 
fast processing. Each node in the trie simply holds the frequency counts of the 
symbols and pointers to next nodes. Figure [5]- a simply exhibits such a trie in 
classical setting. Observe that the number of pointers and the frequency counts 
are in order of the alphabet size of the input data, where a standard general 
purpose implementation should consider the 256-bytes ASCII table in practice. 

On the other side, the trie structure that is used when one prefers CCM is 
sketched in Figure ®-b. As oppose to the classical |I7|-ary context trees, CCM- 
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a) Classical Context Trie 



b) Compressed Context Trie 



Fig. 2. Context trie structures in classical and compressed context modeling 



integrated PPM implementation uses binary trees regardless of the alphabet 
size as the compressed context by definition is a bit-stream of just Os and Is. 
Therefore, at each node we do not need to reserve \ U\ pointers for every character 
of the alphabet, but only two pointers for the next bit. 

The gain in space via CCM reported in this study actually stems from that 
reduction of the size of the nodes in the context trie. Note that applying some 
dynamic data allocation tricks to use less space in the trie structures are possible. 
However, such attempts slow down the execution time, and hence, are contrary 
to the goal of achieving fast compression (decompression). Thus, we neglect these 
kind of programming practices. 



4 Implementation Details 

We have implemented the basic PPM (following the Moffat's study [12]), and its 
proposed variant PPM CC with the compressed context modeling. While comput- 
ing the escape and symbol probabilities in both implementations, we used the 
method proposed by Howard&Vitter 9 as this technique consistently achieved 
better than the other alternatives. For entropy coding/decoding, we preferred 
the FastAcQ arithmetic coder of Said [14] . 

1 Available at http://www.cipr.rpi.edu/~said/FastAC.html 
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The PPM CC replaces the classical context modeling in the basic PPM with the 
CCM. In order to be able to compute the compressed context of a position via 
O tft- -order Huffman coding, we first make an initial pass over the input file and 
generate the corresponding Huffman tree. The probability of the next symbol 
is estimated according to the first fc-bits of the preceding compressed context 
instead of the most immediate symbols in the classical PPM. 

A major difference between PPM CC and PPM appears while decreasing the 
context length in case of escape symbol emissions. Shortening the compressed 
context length by one bit is not appropriate as it causes long chains of escape 
sequences. Thus, after emitting an escape, we need to move up in the context 
trie in steps of a fixed number of bits, which we refer as pitch size throughout 
the study. Aiming to be compatible with the classical solution of decrementing 
context by one symbol, in PPM CC the pitch size is assumed to be the average 
code length of the file that is computed during the generation of the O'^-order 
Huffman codes. 



5 Experimental Results 

Experiments are conducted on large Calgary corpus to measure the performance 
of the CCM-based implementation PPM CC versus the standard PPM. 



Average PPM compression on Large Calgary Corpus 
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Fig. 3. The compression ratios of the standard PPM and proposed PPM CC in 
bits/symbol for different context lengths. 
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Figure [3] shows the compression ratios achieved by the classical PPM and 
PPM CC . The pitch sizes used by PPM CC are 6 for geo, objl, obj2, trans files, 2 for 
the pic file, and 5 in all over the rest. The main concern of this study is to reduce 
the memory consumption, and thus, trading the compression ratio versus space 
is anticipated. In general, the competitiveness attained in low orders by PPM CC 
could not be sustained in higher orders due to the increase in escape character 
emissions as indicated in Figure 0] plotting the number of escape characters 
emitted per symbol. 





Fig. 4. Average number of escape characters emitted per symbols in PPM and PPM, 



The main factor effecting the compression performance is the ambiguity, 
which is more dominant in high dimensions, introduced in CCM. The Huffman 
codes are uniquely decipherable, and hence, prefix-free, which means the Huff- 
man code generated for a symbol cannot be a prefix of another one. However, the 
corresponding codes of two distinct characters can share a common prefix. When 
the length of the C(ij_itj_2 ■ • • ij-^) is longer then k, the Huffman code of te can 
only be partially included in the compressed context bit-stream, which might 
cause an ambiguity. As an example, assume that the Huffman codes of letters m 
and n are 1101100 and 111010000 respectively. If the last symbol ti in a sample 
context is m and there is only a two bits vacancy in the compressed context, then 
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we will append 11 to the bit-stream, which is also the initials of the code of n. 
In such a case the first fc-bits of C(ti_iti_2 ■ • ■ m) and C(ii_iti_2 . . . n) will be 
equal, and hence, an ambiguity will arise that eventually weakens the statistics. 

Another similar problem is caused by the rare characters that have relatively 
longer Huffman codes. Let's assume we are using a 10-bits compressed context 
model, where tj_i is j with a 19 bits long code. This means we can only use 
the partial information of j stored in its initial 10-bits. When this 10 bits long 
sequence is also a prefix of another symbol's code, the statistics of those contexts 
are unified, which in turn causes inefficiency in entropy coding. 



PPM 


PPM CC 


% of gain in 




Number of 


bits per 




Normalized # 


bits per 


memory 


compression 


order 


Nodes 


symbol 


order 


of Nodes (y) 


symbol 


space 


ratio 


1 


83 


3.603 


6 


65 


3.523 


21.69 


2.23 


2 


1909 


2.907 


10 


1017 


2.860 


46.73 


1.60 


3 


15205 


2.474 


14 


10612 


2.482 


30.21 


-0.33 


4 


65161 


2.323 


18 


61910 


2.365 


4.99 


-1.81 


5 


189280 


2.325 


21 


168448 


2.369 


11.01 


-1.91 


6 


417272 


2.367 


24 


369295 


2.406 


11.50 


-1.63 



Table 1. Analysis of file bookl in experiment 1. 



The memory usage comparisons of the proposed schemes against the classical 
PPM is achieved by comparing the number of nodes in the corresponding context 
tries as this mainly determines the actual space usage. A standard node in an 
ordinary context trie has \S\ integers and \S\ pointers, where there are \S\ 
integers and 2 pointers in the CCM tries as indicated previously in Figure [5J 
Hence, a PPM CC tree with x nodes occupy space equal to an ordinary tree with 
y = x ■ ^2-\s\ n °des assuming that the integer and pointer types are of same 
size. We refer y as the normalized number of nodes in CCM trie. 

On each file of the corpus we compare order-/ PPM CC against order-fc PPM, 
where I is the largest number that the size of the CCM trie is less than the 
size of the classic trie. A sample analysis performed on file bookl is given in 
Table [TJ The alphabet of bookl is of 82 characters. Thus, while computing the 
normalized node count y, we multiply the number of nodes in the CCM trie 
with = 0.51. The compressed context is decremented in orders of 5 bits 

that is the rounded average code length 4.56 bits measured according to th - 
order Huffman encoding. The summary of this analysis performed over all files 
of the Large Calgary corpus is given in Table [5] The best trade=-offs achieved 
are marked in bold throughout the table. 

Careful readers should have noticed that although the individual size of a 
node is reduced, the number of total nodes would be much larger when CCM 
is used as the CCM tree is much less sparse than the tree in the classical con- 
text modeling. Thus, the advantage of using small nodes will diminish with the 
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% of gain in memory(M) and compression ratio (C) 
order-1 order-2 order-3 order-4 order-5 order-6 

MCMCM CMCMCMC 



bib 


23 1 7 -2 1 9 


5 89 1 03 


38 -5 69 


8 58 -7 38 


6.77 -7.28 


4.44 -6.42 


bookl 


21.69 2.23 


46.73 1.60 


30.21 -0.33 


4.99 -1.81 


11.01 -1.91 


11.50 -1.63 


book2 


34.02 0.34 


38.45 5.17 


17.10 0.19 


19.02 -3.86 


15.95 -4.37 


12.64 -3.93 


geo 


0.78 1.39 


28.72 -1.78 


14.60 -4.18 


10.29 -5.84 


14.95 -7.01 


0.42 -7.48 


news 


35.35 -1.24 


10.86 5.09 


33.35 -3.83 


2.88 -5.65 


1.32 -5.85 


14.47 -5.41 


objl 


1.95 -4.10 


4.72 -11.38 


24.00 -13.49 


7.71 -14.69 


11.47 -14.50 


1.75 -14.37 


obj2 


0.39 3.17 


10.87 1.00 


19.36 -5.31 


16.18 -8.61 


14.24 -9.25 


12.14 -8.76 


paper 1 


33.33 -0.80 


42.74 -4.99 


9.58 -6.89 


6.96 -8.12 


3.26 -7.53 


14.86 -7.26 


paper2 


30.43 0.88 


33.94 -0.33 


31.87 -5.63 


23.95 -5.94 


16.78 -5.00 


11.03 -4.56 


paper3 


23.53 -0.84 


28.73 -1.48 


27.59 -7.05 


19.05 -7.36 


10.97 -6.60 


4.74 -6.21 


paper4 


19.75 -3.11 


8.37 -5.92 


7.23 -11.29 


23.31 -10.95 


11.08 -10.54 


2.39 -10.52 


paper5 


30.43 -5.19 


19.88 -10.01 


12.29 -15.19 


2.26 -14.19 


12.68 -14.16 


4.43 -13.99 


paper6 


31.91 -1.98 


39.55 -4.9 


73.95 -8.62 


1.91 -8.93 


17.03 -8.67 


10.76 -8.27 


pic 


19.38 3.55 


38.56 1.30 


9.93 -3.98 


24.36 -6.63 


21.60 -9.75 


14.61 -10.98 


progc 


31.18 -2.37 


5.27 -3.28 


11.66 -9.46 


7.82 -9.69 


3.66 -9.10 


0.44 -9.20 


progl 


27.27 2.08 


34.11 -3.17 


1.02 -6.34 


0.83 -7.80 


17.05 -7.94 


10.28 -8.00 


progp 


28.89 -0.07 


41.22 -8.85 


10.72 -10.50 


9.61 -9.78 


6.51 -9.68 


3.28 -9.48 


trans 


36.00 -3.15 


23.46 -4.00 


29.07 -14.36 


1.32 -9.56 


1.11 -9.01 


0.16 -8.64 


AVG. 


23.86 -0.63 


25.67 -2.49 


20.22 -7.33 


10.61 -8.16 


10.97 -8.23 


7.46 -8.06 


MAX. 


35.35 3.17 


46.73 5.17 


73.95 0.19 


24.36 -1.81 


21.60 -1.91 


14.86 -14.37 


MIN. 


0.39 -5.19 


4.72 -11.38 


1.02 -15.19 


0.83 -14.69 


1.11 -14.50 


0.16 -1.63 



Table 2. PPM CC results. 
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increasing depth of the tree in CCM based implementation. The experiments 
complies with this observation that CCM integration is particularly more ben- 
eficial in low orders of PPM. The trade-off between the compression ratio and 
the memory consumption is depicted in Figure [5l 



Trading space versus compression ratio with PPMcc 

■ % of gain in space ■ % of gain in compression ratio 



30.00 



23.86 



25.67 




2 3 

PPM order in symbols 



Fig. 5. The trade-off accomplished between compression ratio and memory consump- 
tion. 



6 Conclusions 

We have presented a technique to be used in PPM implementations, namely 
PPM CC , based on the compressed context modeling with the aim of using less 
space throughout the compression. The investigation of the trade off between 
the memory consumption and compression ratio showed that PPM CC is beneficial 
especially on low orders, where the gain in space is much more than the sacrifice 
in compression ratio. The results depicted in Table [2] reflects interesting obser- 
vations that the gain in space is much more than the the loss in the compression 
ratio. It is noteworthy that having a space improvement in low orders is of par- 
ticular importance as it may help the integration of PPM style compressors in 
mobile environments with less resource requirements. 

Future studies may consider using different encoding techniques while com- 
pressing the context such as the higher order Huffman codes rather than the th 
order used in this work. Methods to decrease the escape symbol emission rates 
as well as improving the prediction power with CCM will be significant with the 
aim of achieving better compression in less space, where instead of integrating 
CCM into fixed-length models, using the compressed versions of the other con- 
text modeling techniques might make sense. Studies to investigate CCM usage 
in other PPM variants may also be interesting. 
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